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Abstract — In this paper we propose a graph-based data clus- 
tering algorithm which is based on exact clustering of a minimum 
spanning tree in terms of a minimum isoperimetry criteria. We 
show that our basic clustering algorithm runs in 0(n log n) and 
with post-processing in 0{n^) (worst case) time where n is the 
size of the data set. We also show that our generalized graph 
model which also allows the use of potentials at vertices can 
be used to extract a more detailed pack of information as the 
outlier profile of the data set. In this direction we show that our 
approach can be used to define the concept of an outlier-set 
in a precise way and we propose approximation algorithms for 
finding such sets. We also provide a comparative performance 
analysis of our algorithm with other related ones and we show 
that the new clustering algorithm (without the outlier extraction 
procedure) behaves quite effectively even on hard benchmarks 
and handmade examples. 

Index Terms — isoperimetric constant, Cheeger constant, nor- 
malized cut, graph partitioning, perceptual grouping, data clus- 
tering, outlier detection. 



I. Introduction 
A. A concise survey of main results 

Data clustering, as the unsupervised grouping of similar 
patterns into clusters, is a central problem in engineering 
disciplines and applied sciences which is also constantly under 
theoretical and practical development and verification. In this 
article we are concerned with graph based data clustering 
methods which are extensively studied and developed mainly 
because of their simple implementation and acceptable effi- 
ciency in a number of different fields as signal and image 
processing, computer vision, computational biology, machine 
learning and networking to name a few. 

The main contribution in this article can be described as a 
general graph-based data clustering algorithm which falls into 
the category of such algorithms that use a properly defined 
sparsest cut problem as the clustering criteria. In this regard, 
it is instructive to note some highlights of our approach before 
we delve into the details in subsequent sections (details of 
our approach as well as a survey of related contributions will 
appear in the second part of this introduction). 

It has been already verified that graph-based clustering 
methods that operate in terms of non-normalized cuts are not 
suitable for general data clustering and behave poorly in com- 
parison to the normalized versions (e.g. see |[29]). Moreover, 
it is well known that there is a close relationship between the 
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minimizers of the normalized cut problem, spectral clustering 
solutions, mixing rates of random walks, the minimizers of 
the K-means cost function, kernel PCA and low dimensional 
embedding, while the corresponding decision problems are 
known to be NP-complete in general (e.g. see yj, [3J-[5J, 
| fT9l , p4) and references therein). 

In this article, we will provide an efficient clustering al- 
gorithm which is based on a relaxation of the feasible space 
of solutions from the set of partitions to the larger set of 
subpartitions (i.e. mutually disjoint subsets of the domain). 
From one point of view, our algorithm can be considered as 



a generalization of Grady and Schwartz approach |10|, |11| 



based on isoperimetry problems while we extensively rely on 
the results of |_5^| and \4\. Also, we believe that this relaxation 
which is based on moving from the space of partitions to the 
space of subpartitions not only provides a chance of making 
the problem easier to solve but also is in coherence with 
the natural phenomena of having undesirable data or outliers. 
We will use this property to show that our algorithm can be 
enhanced to a more advanced procedure which is capable of 
presenting a hierarchy of data similarity profile which can lead 
to the extraction of outliers. 

In this regard, one may comment on some different aspects 
of this approach as follows. 

Theoretical aspects: From a theoretical point of view, it is 
proved in |4| that the normalized cut criteria is not formally 
well defined in the sense that it does not admit a variational 
description through a real function relaxation of the problem 
(i.e. it does not admit a Federer-Fleming type theorem). 
However, for k > 2, the well-defined version, known as the 
k-isoperimetry problem (defined in |4|), whose definition is 
in terms of normalized-flow minimization on fc-subpartitions, 
actually admits such a relaxation. It should be noted that 
although there are some approaches to clustering which are 
based on the classical 2-isoperimetry (i.e. Cheeger constant) 
on weighted graphs (e.g. see pTj), but as it follows from the 
results of |4|, in the classical model the difference between 
the cases of partitions and subpartitions only is observable 
when fc > 3, and consequently, our approach is completely 
different in nature from iterative 2-partitioning or spectral 
approximation methods based on eigenmaps already existing 
in the literature. 

Also, as a bit of a surprise (see Theorem [3]l, it turns out that 
a special version of the /c-isoperimetry problem is efficiently 
solvable for trees. This fact along with a well-known approach 
of finding an approximate graph partitioning through minimum 
spanning trees constitute the core of our algorithm. 

Practical aspects: There are different practical aspects of 
the proposed algorithm that one may comment on. Firstly, the 
proposed approximation algorithm run-time is almost linear in 
terms of the size of input-data which provides an opportunity 



to cluster large data sets. Also, it should be noted that our 
algorithm for /s -clustering obtains an exact optimal clustering 
of a suitably chosen subtree in a global approach and does 
not apply an iterative two-partitioning or an approximation 
through eigenmaps. This in a way is one of the reasons 
supporting a better approximation of our algorithm compared 
to the other existing ones. In this regard, we also present a 
number of experimental results justifying a better performance 



of our algorithm in practice (see Tables m nu and Section III i 



Secondly, we should note that approximation through the 
isoperimetry criteria provides an extra piece of information 
as a (possibly nonempty) subset of the domain (since the 
union of subpartitions may not be a covering). This piece of 
information makes it possible to obtain the almost minimal 
clustering as well as to extract deviated data and outliers, at the 
same time. In order to handle this extra information, we have 
generalized our graph model to the case of a weighted graph 
with potential. This generalization of the graph representation 
model is another original aspect of our contribution where we 
rely on results of |5| and |4| in this more general setting (see 
Theorem [3I1. It is interesting to note that in this more general 
setting our results presented in Section IV show that not only 
we can handle the case of outlier extraction with clustering at 
the same time, but also the new set up will make it possible 
to extract outliers even in the case of 2-clusterings (which is 
theoretically meaningless by definition when one is using the 
classical isoperimetry or 2-normalized cuts). Using this setting 
we propose a formal definition for the outlier profile of a data 
set and, moreover, we provide a couple of examples to study 
the efficiency of the proposed method in extracting outliers. 

B. Background and related contributions 

Unsupervised grouping of data based on a predefined sim- 
ilarity criteria is usually referred to as data clustering in 
general, where in some more specific applications one may 
encounter some other terms as segmentation in image pro- 
cessing or grouping in data mining. Based on its importance 
and applicability, there exists a very vast literature related to 
this subject (e.g. see |J6|, p3j for some general background), 
however, in this article we are mainly concerned with cluster- 
ing algorithms that rely on a representation of data as a simple 
weighted graph in which the edge-weights are tuned, using a 
predefined similarity measure (e.g. see Section III] ||28J and 
references therein). 

Graph-based data clustering is usually reduced to the graph 
partitioning problem on the corresponding weighted graph 
which is also well-studied in the literature. To this end, it 
is instructive to note that from this point of view and if one 
considers a weighted graph as a geometric object, then the 
partitioning problem can be linked to a couple of very central 
and extensively studied problems in geometry as isoperimetry 
problem, concentration of measure and estimation of diffusion 
rates (e.g. see (T], B) and references therein). 

A graph-based clustering or a graph partitioning problem is 
usually reduced to an optimization problem where the cost 
function is a measure of sparsity or density related to the 
corresponding classes of data. From this point of view, it 



is not a surprise to see a variety of such measures in the 
literature, however, from a more theoretical standpoint such 
similarity measures are well-studied and, at least, the most 
geometrically-important classes of them are characterized (e.g. 
see |27| for a very general setting). In this context such 
measures usually appear as norms or their normalized versions 
that should be minimized or maximized to lead to the expected 
answer. 

What is commonly refereed to as spectral clustering is the 
case in which the corresponding normalized norm is expressed 
as an L^ (i.e. Euclidean) norm and admits a real-function 
relaxation whose minimum is actually an eigenvalue of the 
weight (or a related) matrix of the graph. This special case 
along with the important fact that, the spectral properties 
(i.e. eigenvalues and eigenfunctions) of a finite matrix can 
be effectively (at most in 0{rr') time) computed, provides 
a very interesting setting for data clustering in which the 
corresponding optimization problem can be tackled with using 
the well-known tools of Unear algebra and operator theory (e.g. 
see 1(16), In), ||20)-||22), |[3l|, JSS), ||35| and references therein 
for a general background in spectral methods). 

Although, applying spectral methods are quite effective and 
vastly applied in data clustering, but still the time complexity 
of the known algorithms and also the approximation factor 
of this approach in not as good as one expects when one is 
dealing with large data sets (e.g. see p6j that proves an ap- 
proximation factor of at most 2). On the other way round, these 
facts leads one to consider the original normalized versions of 
the L^ norm that reduces clustering to the sparsest (or similar 
minimal) cut problems or their real-function relaxations as 
the corresponding approximations. It is proved in HI that the 
most natural such normalized norms do not admit real-function 
relaxations when they are minimized over partitions of their 
domain. Moreover, it is shown in the same reference that 
such normalized norms do admit such real-function relaxations 
when they are minimized over subpartitions of their domain. 
In this new setting the minimum values, that correspond to the 
eigenvalues in the spectral Li^ setting, are usually referred to 
as isoperimetric constants. 

Unfortunately, contrary to the case of L^, decision problems 
corresponding to the isoperimetry problems are usually NP- 
hard (e.g. see |J5), p3], p9), p9|), which shows that comput- 
ing the exact value of the isoperimetric constants is not an easy 
task. There has been a number of contributions in the literature 
whose main objectives can be described as to proposing 
different methods to get around this hardness problem and find 
an approximation for the corresponding isoperimetry problem 
as a criteria of clustering, and consequently, obtaining an 
approximate clustering of the given data. 

In this regard, one may at least note two different ap- 
proaches as follows. In the one hand, there has been contribu- 
tions who has tried to reduce the problem to the more tractable 
case of trees by first finding a suitable subtree of the graph 
and then try to approximately cluster the tree itself (e.g. see 
||2), lH), in), (18), (25), (32), (34)). The difference between 
such contributions usually falls into the way of choosing the 
subtree and the method of their clustering. On the other hand, 
one may also try to obtain a global clustering by a mimic 



of spectral methods through solving not an eigenfunction 
problem but a similar problem in L^ (e.g. see 1 10 1, pi)). These 
methods usually follow an iterative 2-partitioning since there 
was not much information about approximations for higher 
order eigenfunctions or similar solutions in L^ until recently 



(e.g. see ITTJ, |[T9|). 

Our main contribution in this article can be described as 
a culmination of above mentioned ideas that strongly rely on 
some recent studies of higher order solutions of isoperimetry 
problems (see ||4|, pi), in which we first search for a suitable 
spanning subtree and after that we obtain the exact solution 
of the corresponding optimization problem for our suitably 
chosen isoperimetric constant (see Section lllll. Also, in this 
setting we will obtain a subset of unused data given as the 
complement of the obtained clustering as a subpartition and 
we will try to analyze this extra output of our algorithm as 
an outlier detection procedure (see Section |IV| i. To do this 
we adopt a generalize graph model as a weighted graph with 
potential and we will be needing a generalization of some 
results of f5l and f4| that will be presented in Section In] 
Also, in Sections [III] and IV we provide experimental results 
to show the efficiency and the performance of our proposed 
algorithms. 

II. The clustering model and algorithm 

In this section we introduce our graph based model and the 
proposed clustering algorithm. 

Fix positive integers d > 1, k and n such that 2 < k < n 
and let X — {xi, . . . ,x„} be a set of n vectors in M'*. A 
standard fc-clustering problem for X is to find a fc-partition 
of X with a high intra-clusters similarity as well as a low 
inter-clusters similarity (with respect to a predefined similarity 
measure). 

In graph-based methods of clustering, the data-set X is 
represented by a weighted graph on n vertices, where each 
vertex corresponds to a vector in X and the weight of an 
edge XiXj reflects the similarity between vectors Xi and Xj. 
In this article, our graph model consists of a simple graph 
G = {X,E) on the vertex set X := {xi, . . . ,Xn} endowed 
with three weight functions, namely, a vertex-weight function 
u} : X ^)- M+, an edge- weight function (p : E ^)- M+ 
called the flow and a function p : X ^ M. called the 
potential. The function u is used for the weight of each 
element of the data-set X and the similarities between pairs 
of elements are denoted by the flow function (p. The potential 
of a vertex Xi e X, p{xi), is used to represent the extent 
of isolation or alienation of Xi from other elements. In this 
setting, the weighted graph is called the similarity or affinity 
graph and is denoted by {G,Ld,(p,p) where, hereafter, the 
size of the data-set which is equal to the number of vertices 
is fixed to be n := \X\. For instance, in a classic way of 
modelling similarities, one may define the flow and vertex 
weight functions as follows. 



V 1 < i 7^ j < n, (p{xiXj) := exp(-||xi - Xj\\.^/2a^), 

n 

V < i < n, uj{xi) :— ^ ^ ip{xiXj). 

3 = 1 



(1) 



We do not elaborate on the more or less complex subject of 
proposing suitable methods for graphical presentation of data 
sets. The interested reader is referred to the existing literature 
(e.g. p6| ) to see how the scale parameter a is chosen and how 
it affects the performance of graph based algorithms. However, 
in order to present a complete comparison in Section III we 



consider both global scaling and local scaling models in our 
performance analysis. 

In the sequel we will discuss the important role of potentials 



in Section IV when we elaborate on the capability of our 
algorithm to detect the outlier profile of the data set. 

To describe our model of clustering, first we need to define 
a couple of notations. For a subset A C X, the boundary of 
A, denoted by dA, is defined as, 

dA ■— {e ^ xy £ E \ x £ A,y e X\ A}. 

Also, for any given finite set A, a function f : A ^ M. and a 
subset B C A, we define 

/(i3):=5]/(x). 
xeB 

The collection of all fc-partitions of the set X is denoted by 
£Pk{X). Given a weighted graph {G,uj,(p,p) and an integer 
2 < k < n, (the maximum version of) the k-normalized cut 
problem seeks for a fc-partition A := {^i, . . . , A^} € ^^(X) 
that minimizes the following cost function, 

(p{dA,) + p{Ai) 



cost(y^) 



max 

i<i<fc 



ojiA,) 



(2) 



We define 



MNCfe(G) 



min cost(^). 



The quotient {ip{dAi) + p{Ai))/u!{Ai) is called normalized 
flow of the set Ai. The (max) normalized cut problem is known 
to be A^P-hard for general graphs even when k = 2 1 ,23 J . In 
|5 1 the same problem is investigated for weighted trees and it is 
proved that the corresponding decision problem for arbitrary fc 
remains A^P-complete even for simple (unweighted) trees. In 
the same reference a tractable relaxation of the problem is also 
proposed which is based on the relaxation of the feasible set 
from the set of fc-partitions to the set of k-subpartitions, i.e. fc 
disjoint subsets of the vertex set. The set of all fc-subpartitions 
of X, denoted by ^fc(X), is defined as follows, 

^k{X):={A:={A,,...,Ak} \ 

V i, Aj C X, and V i 7^ j, A, n A^ = 0}. 

For a subpartition A = {^i, ■ • ■ , Ak} € ^k{X), each element 
in X \ ufLj^A,; is called a residue element (w.rt. A) and the 
number of residue elements is called the residue number of 
A. 

The (maximum version) of isoperimetric problem seeks for 
a minimizer of the cost function in (]2|l over the space of all 
fc-subpartitions of X. We denote the minimum by MISOa:(G), 
i.e. 



MISOfc(G) 



min cost(y^). 

Ae&kiX) 



It is not hard to check that there exist instances where 
the minimum MISOfe(G) occurs on a subpartition which 



is not actually a partition, and also it can be verified that 
MNC2(G) =: MIS02(G) when the potential function is equal 
to zero (see |4j). 

The idea of relaxing the normalized cut problem to the 
isoperimetric problem have a number of justifications from 
different points of view. On the one hand, from a computa- 
tional viewpoint, the isoperimetric problem is more tractable 
in some special cases (e.g. see Theorem [3]l, while, on the 
other hand, from a theoretical viewpoint, it can be verified 
that the isoperimetry problem admits a Federer-Fleming-type 
theorem, while the normalized cut problem doesn't satisfy 
such a relaxation in general (see Theorem [T] and L4J). 

For two given real functions / and g on X and a positive 
weight fuction cu : X -^ M+ define the weighted inner product 
as 



{ftJj)oj ■■= '^fix)g{x)uj{x). 



xGX 



Also, if ^+(X) is the set of all non-negative real functions 
on X and fc > 1 is a positive integer, ^+(X) stands for the 
set of k mutually orthogonal functions in ^+(X), i.e. 

^+(x):={{/i,...,M|/,e^+(x), 

(/.,/,>c.-0, Vz^j}. 

Now, one may verify that the following Federer-Fleming-type 
theorem holds. We deliberately exclude the proof since it 
is essentially a straight forward generalization of the proof 
already presented in [4] for the standard case (i.e. when the 
potential function is equal to zero). 

Theorem 1 fT?). 

For every weighted graph {G,uj, ip,p) and integer k. 



MISOfc(G) = inf max 



( J2 Vi^y) \h{x)-h{y)\ + Y.p{x) \U{x)\\ 

xy£E x£V 



V 



^ w(x) \Mx) 



xev 



/ 



It is shown in [5J that in the case of weighted trees, despite 
intractability of the normalized cut problem, the decision 
problem related to MISOfe(G) is efficiently solvable in the 
following sense. 

Theorem 2 ||5|. 

For every weighted tree {T,uj,ip,p), the decision version of 
the (max) isoperimetry problem can be efficiently solved in 
linear time. 

Since our proposed clustering method is based on the 
algorithm announced in Theorem |2] we include the algorithm 
for completeness (see Algorithm uj. In what follows assume 
that a vertex v is selected as the root and the vertices are 
ordered in a BFS order, as xi, . . . ,Xn = v. 



Algorithm 1 Given a weighted tree {T,uj,(p,p), an integer k 
and a rational number N, decide whether there exists some 
A e ^k{X) such that cost(y4,) < iV as in (J2]l. 



InitiaHze the set function rj : X ^ ^'(X) by r]{xi) 


:= {x.} 


for each 1 < i < n. 




Define i — j := 1. 




while j < k and i <n do 




Let u be the unique parent of x, and e := uxi 


G E (if 


i — n, then define (p{e) := 0) 




it p{xi) + (p{e) < Nuj{xi) then 




j ^ j + 1, Aj ^ ri{x,). 




uj{Aj) ^ uj{x^). 




if{dAj)^ip{e)+pixi), 




p{u) ^p{u) + Lp{e). 




else if p{xi) - ip{e) < Nuj{xi) then 




■q{u) ^■q{u)\J-q{xi), 




Uj{u) ^ Uj{u) +ljj{Xi), 




p{u) ^ p{u)+p{xi). 




else {i.e. p{xi) - ^(e) > Nuj{xi)} 




p{u) <- p{u) + (^(e) 




end if 




end while 




\i j = k then 




return YES and {Ai,...,Ak} 




else 




return NO 




end if 





In this article we provide an improved version of Theorem l2] 
as follows. 
Theorem 3. 

For every weighted tree {T,uj,ip,p) on n vertices and every 
integer 2 < k < n, the value o/MISOfc(r) and a minimizer 
in &k{V) can be found in time 0(n log n). 

Proof: Let T — {V,E,uj,ip,p) be a fixed weighted tree 
on n vertices and 1 < fc < n be a fixed integer Without loss 
of generality, assume that all the weights are integer. Define 



w* := TmTiuj(x), Lo* 
xev 



XI ^(^)' 



xev 



V?, := min(y9(e), Lp* := V if{e), 

eetj 

eeE 

p^ := Ti\uYp{x) and p* :~ y ^ p{x). 

xev 

Note that for every non-empty subset A C V, the value 
of {if{dA)+p{A))/uj{A) is a rational number within the 
interval [(t^* + P*)/^* ,{'P* +P*)/^*]- Furthermore, for two 
non-empty subsets A,B CV, if a := {(p{dA) + p{A))/u!{A) 
and b := {(p{dB) + p{B))/lu{B) are distinct, then 



l«-^l ^ ,,.2 ■ 



(3) 



Based on Algorithm [T] Algorithm l2] described below, finds a 
minimizer for MlSOfc(r). 



Input: 
data-set X 
and integer 

k> 1 



Pre-process: 

generate the 

affinity graph 

G 



Fig. 1. An outline of tlie main algorithm. 



Approx. 



Find: 

a minimum 

spanning tree 

T 



Exact 



Compute: 
MISOfc(T) 

and a 
minimizing 
subpartition 



Approx. 



Post-process: 

reduce the 

residue 

number 



Algorithm 2 Given a weighted tree (T, uj,ip,p) and an integer 
k, find a minimizer achieving MlSOfc(r). 

Let ao ^ ^2^ and /3o ^ ^. 

Let t ^ log(2w*^(/?o - ao)) - iog(^, +p*). 

Initialize a <— ap and /3 <— /3o. 

for i = 1 to t do 

Applying Algorithm fll decide if MISOfc(T) < 



2 ■ 



if MlSOfc(r) < ^then 

else 

end if 
end for 

Let A be the fc-subpartition output of Algorithm [T] for 
deciding MlSOfc(r) < /3. 

return MISOfc(T) = cost(yl) and A. 



To prove the correctness of Algorithm l2j note that after 
the for loop, we obtain an interval [a, /3], containing rational 
numbers MISOa;(T) and cost(.4), whose length is equal to 



/3o - ao 1^* + p* 



2uj* 



and consequently, by ^, cost(y^) = MlSOfc(r). 

Finally, the runtime of this algorithm is verified to be in 

Oint) = 0{n (log(2a;*2(^^ _ ^^)) _ l^g^^^ ^^^^ 
= O(nlogn). 



Based on these facts, let us describe the main parts of our 
proposed clustering algorithm as follows. (The outline of the 
algorithm is depicted in Figure [T]) 

1) Given the data-set of vectors X, construct the affinity 
graph G on X along with the weights 

ClyXi^ Xj ) . — I j Xj Xj 1 1 2 ■ 

2) Find a minimum spanning tree T of (G, d) and construct 
a weighted tree (T, cu, ip) using the similarity weights as 
inQ. 

3) Apply Algorithm |2] to find MlSOfc(r) along with a 
minimizing subpartition A £ !Sk{X). 

4) Use a post-processing algorithm (Algoritm l3]l to reduce 
the residue number of A and output the optimized 
clustering A* . 

The rest of this section is devoted to the post-processing 
algorithm that tries to reduce the residue number of the 
subpartition obtained as the output of Algorithm l2] For this 




Fig. 2. 
edge. 



A simple 2-clustering problem and the associated tree and break 



it is natural to consider the following decision problem. 

MINIMUM RESIDUE NUMBER 

INSTANCE: A weighted ti-ee T = {V,E,ijj,ip) (without 
potentials) and two integers fc > 1, A^ > 0. 

QUERY: Does there exist a minimizing subpartition 

A e &k{V) achieving MlSOfe(r) whose 
residue number is at most A^? 

Unfortunately, the following proposition shows that one may 
just hope for an approximation of the above problem since the 
decision problem is actually iVP-complete. 
Proposition 4. 

The decision problem MINIMUM RESIDUE NUMBER is 
NP-complete in the strong sense for weighted trees. 

Proof: See Appendix for a proof. ■ 

In what follows we propose an approximation scheme for 
the above problem which will constitute the post-process 
part of the main algorithm (a schematic general case of this 
procedure is depicted in Figure [3] which can be helpful in 
following the details that will follow). 

In order to ensure a high intra-cluster similarity we try 
to force the induced subgraph on each cluster to form a 
connected subgraph. Therefore, our basic strategy is to look 
for a minimizing subpartition with a small residue number 
whose parts induce connected subgraphs. 

To find such a subpartition, with an initial good subpartition 
in hand, we follow the following post-processing procedure 
that checks the two following facts, 

1) Each part of the subpartition induces connected sub- 
graphs. 

2) No subset of residue elements can be added to any part 
to make a better subpartition with the same connectivity 
property. 

For this, assume that A = {Ai, .. .,Ak) G &k{X) is the 





Fig. 3. A typical scheme of the post-process subroutine. 



minimizing subpartition, obtained from Algorithm l2] Also, 
in order to keep the pseudo-code concise we introduce the 
following terms. 

1) Non-residue vertex: A vertex in U^^iAi. 

2) Break edge: An edge in E{Ai,A'i) for some i. 

3) Residue subtree: A subtree obtained by removing all 
break edges from the original tree whose all vertices are 
residue elements. 

4) Start vertex of a residue subtree: A residue element 
in the residue subtree which is one end of a break edge. 

Figure l2] shows a typical pattern of a break edge and the 
corresponding tree for a simple 2-clustering problem using 
the above algorithm. 

The post-process algorithm can be summarized as follows 
(see Figure [3]). 

1) Compute all residue subtrees. 

2) Contract each Ai with the normaUzed flow fi to a single 
vertex a^ and set at := a^ which has the maximum /,;. 

3) Let C be a non-flagged residue subtree connected to at 
with start vertex s and break edge e. 

4) Contract edge e to obtain a new residue subtree C and 
start vertex s' and update the weights. 

5) Run Algorithm |3] on C" with the root s' and A^ := 
MlSO(r) and let ^J be the output. 

6) If v4j is empty, flag C and goto Step [3] otherwise: 

a) replace At by A'^. 

b) update the residue subtree C. 

c) clear the flags of all residue subtrees. 

d) goto Step |2] 

In order to prove that the procedure performs correctly, 
we should prove that searching inside each residue subtree is 
sufficient to ensure the properties mentioned before. For this, 
assume that A = {Ai, . . . ,Ak} £ Sik{X) is a minimizing 
subpartition. Let Ci and C2 be two residue subtrees and for 
1, 2, let Si C Ci be a subset connected to Ai. Then 



Algorithm 3 Post-process 



each i = 
we have 



Now, if 



v{^A,)+p[A^) 


uj{A,) " ^■^^^"'• 


^(a(AiU5,))+p(^iU5,) 



(T). 



(4) 



uj{Ai U Si) 
then from Q and (|5]), we conclude that 

ip{d{Ai \JSi\J S2)) +p{Ai U 5i U S2) 

L0{Ai U S"! U S2) 



> MISOfe(T), (5) 



Input a subtree C with the root s and rational number N. 
Order the vertices of C in BFS order as xi,X2, ■ ■ ■ ,Xt — s. 



Set i = 1 and initialize set function 77 


V{C) 


^ ViViC)) 


by 77(3;,) := {xi}. 








for i = 1 to ^ do 








u — parent{xj). 








e = {x,,u}. 








li p{xi) — (p{e) < N u}{xi) 


then 






p{u) ^ p{u) +p{x.,). 








uj{u) <— uj{u) + Uj{Xi). 








■q{u) ^■q{u)\J-q{xi). 








else 








p(u) ^p{u) + Lp{e). 








end if 








end for 








return 77(5) 









> MISOfc(T). 



This clearly shows that searching inside each residue subtree 
for a good subset is sufficient to find all good subsets, and 
hence, the post-process algorithm performs correctly. 

III. Analysis and Experimental Comparison 

In this section we go through the time complexity and 
performance analysis of our proposed algorithm. 

A. Time complexity analysis 

Based on the details presented in the previous section the 
algorithm consists of three phases: 

• PHASE I: A pre-processing phase where an affinity 
graph is constructed from the input data and a minimum 
spanning tree of the graph is obtained. 

• PHASE II: A tree-partitioning phase where the mini- 
mum spanning tree is sub-partitioned according to an 
isoperimetry criteria (Algorithm |2|i. 

• PHASE III: A post-processing phase where residue sub- 
trees are reprocessed in order to find a minimizing 
subpartition with the minimal residue number 

In what follows we elaborate on estimating the time complex- 
ity of each phase. We denote the time complexity function of 
the algorithm with t{n) where n is the size of the data set, 
where 

t{n)=h{n)+t2{n)+h{n), (6) 



TABLE I 

Performance on UCI database (global scaling): njw=Ng-Jordan-Wei ss (24| , LT=Li-Tian | 19|, GS=Grady-Schwartz pi] , 

SM=Shi-Malik [2?|, WJHZQ=Wang et. al. [32'], DJS=THIS PAPER. 





a = 0.09 1 


Data set 


Size 


Cluster No. 


Dim. 


NJW 


LT 


GS 


SM 


WJHZQ 


DJS 


Wine 


178 


3 


13 


0.331461 


0.286517 


0.471910 


0.297753 


0.325843 


0.280899 


Iris 


150 


3 


4 


0.100000 


0.066667 


0.333333 


0.100000 


0.040000 


0.040000 


Breast 


106 


6 


9 


0.632075 


0.594340 


0.603774 


0.698113 


0.471698 


0.500000 


Segmentation 


210 


7 


19 


0.561905 


0.476190 


0.566667 


0.371429 


0.395238 


0.409524 


Glass 


214 


6 


10 


0.556075 


0.457944 


0.588785 


0.467290 


0.658879 


0.528037 


Average 








0.4363 


0.3763 


0.5129 


0.3869 


0.3783 


0.3573 



TABLE II 
Performance on UCI database (local scaling): NJW=Ng-Iordan-Wei ss (24| , LT=Li-Tian (19), GS=Grady-Schwartz fTT) , 

SM=Shi-Malik f2^, WJHZQ=Wang et. al. |32J, DJS=THIS PAPER. 





i/ = 30 


Data set 


Size 


Cluster No. 


Dim. 


NIW 


LT 


GS 


SM 


WJHZQ 


DJS 


Wine 


178 


3 


13 


0.280899 


0.280899 


0.280899 


0.280899 


0.308989 


0.280899 


Iris 


150 


3 


4 


0.086667 


0.073333 


0.333333 


0.100000 


0.040000 


0.040000 


Breast 


106 


6 


9 


0.622642 


0.622642 


0.622642 


0.641509 


0.471698 


0.509434 


Segmentation 


210 


7 


19 


0.404762 


0.504762 


0.609524 


0.404762 


0.347619 


0.423810 


Glass 


214 


6 


10 


0.560748 


0.495327 


0.504673 


0.644860 


0.658879 


0.560748 


Average 








0.3911 


0.3954 


0.4702 


0.4144 


0.3654 


0.3574 



in which ti(n) denotes the time complexity of the i'th phase. 
The following analysis will show that t(n) G 0{n^) in 
the worst case for the global scaling scenario with a post- 
processing, where in the local scaling setting it is observed 
that the basic algorithm operates in 0{n\ogn) time. 

1) PHASE I ; It is supposed that the z'th object of interest 
is given in a vector representation i^u^il where each element 
of the feature vector is a real number (d is a fixed integer). 
Also, it is presumed that a similarity function S* : M'^ x M'' — )• M 
is given with computational complexity 0{ts{d)) = 0{c) for 
a constant c. 

In construction of a global scale affinity graph one needs 
(2) G 0{n^) times computation of the similarity function. 
On the other hand, in a local scale affinity graph (e.g. see 
p6| ) it is enough to focus on a fixed proximity of each vertex 
(e.g. as is the case in image segmentation application) and 
consequently, one may use any technique for nearest neigh- 
bour search problem as space partitioning, locality sensitive 
hashing, approximate nearest neighbour or other well-known 
methods to obtain the affinity graph of the input vectors in 
sub-polynomial time. Assuming that the input vectors belong 
to a space with a Minkowski metric (which is usually the 
case in real applications) one may use e— approximate nearest 
neighbour to find a fixed number i' of neighbours for each 
object that results in a graph with size \E\ G 0{n) which is 
constructed in time O(nlogn) for a fixed e. 

Given a graph of size \E\ e 0{n) one may easily find a 
minimum spanning tree using well-known algorithms in time 
at most 0{n\ogn). 

Hence, in a local scaling model we may assume that <i (n) G 
O(nlogn) and in the global scaling model we have ii(n) G 
0{n?) in the worst case. 

2} PHASE II; Time complexity of Algorithm [T] was shown 
to be linear in |J5). We verified within the proof of Theorem l3] 
that the runtime of Algorithm [2] lies in 0{n\ogn). Thus, with 



the notations in the proof of Theorem l3] we have 

t2{n) G O (71 (log(2cj*'(/3o - «o)) - log((^* +P*))) 
= 0{n\ogn). 

3) PHASE III.- By definitions the worst case time com- 
plexity of the post-processing algorithm is bounded by O(n^), 
since one may think of a case where all connected components 
of the subgraph are of order one. However, it should be noted 
that in our real experiments the generic cases were observed 
to be far from the worst case. 



B. Experimental results 

In this section we provide our experimental comparison of 
proposed algorithm with some similar algorithms on some 
well-known clustering datasets. In our experiments the edge 
weights are assumed to be equal to 



S{u,v) 
(p(uv) = exp( ) 



(7) 



following the convention in clustering literature, where a 
is the scaling parameter Also, hereafter, i' stands for the 
neighbourhood parameter for local scaling. 

1) UCI benchmarks: Table 11] reports the outcome of the 
performance analysis of four algorithms on UCI machine- 
learning benchmark repository fSOl. Each number in the table 
represents the misclassification rate (i.e. the ratio of incorrect 
labellings to the total number of objects) for the corresponding 
algorithm. For the algorithms that do not get the number of 
clusters as a part of the input a precision threshold is set and 
the algorithm is terminated after that stage. We should also 
report that the GS algorithm just terminated on the Iris data 
set with a 2-clustering and this is the main reason for this 
algorithm relatively high misclassification rate in this case. 



2) Some hard instances: We have also considered the 
performance of our algorithm on a couple of hard artificial 
clustering problems as is depicted in Figure HI As it is clear 
from the results the algorithm has been successful enough to 
extract the correct expected clusters. 



Fig. 4. The first column contains a set of hard problems created artificially 
and the second column consists of four cases chosen from 'http://www. vision.' 
caltech.edu/lihi/Demos/SelfTuningClustering.html where the local scaling pa- 
rameter is set to 1/ = 7 ((T = 0.1 for all problems). 
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IV. The outlier profile of a data-set 

In this section we try to show how our new extended model 
can be used to formalize the concept of an outlier set. First 
of all, it should be noted that the concept of an outlier is 
not easy to formalize since it certainly depends on the data 
scaling factors. Although, there has been very little on the 
formalization of this concept (e.g. see [JT] as a very exceptional 
nice reference), we believe that one should try to define the 
concept of an outlier profile of a data set (to be made precise 
later) than trying to define the concept of an outlier set, 
since depending on ones precision about the concept of being 
far, one may come to very different conclusions about what 
an outlier is. Hence, in what follows we use the flexibility 
of potentials, already introduced in our model, to simulate 



this precision analysis, and consequently, we will be able to 
define the concept of an outlier set with respect to a precision 
parameter a. Therefore, using this parametrization one comes 
to a varying family of sets that will be called the outlier 
profile of the data set. Using this we will also try to extract 
the best candidate for an outlier set and we will propose a 
criteria for extracting such a set while we will also provide 
an approximation algorithm to approach to a solution for 
the problem. We also provide experimental results that can 
serve as evidence to the correctness and applicability of our 
approach. 

We would like to note that this section is just a starter to 
these ideas and we believe there is a lot more that should be 
investigated theoretically and experimentally. 

Definition 5. 

Let G = {X,E,u!,(p,p) be a weighted graph with potential. 
Then given fc G N and a G R+, for any A G ^kiX) define 

costfc,a(yl) := max — , (8) 



and 



MISOfcQ(G):= min costk JA). 



A subset A d V is, said to be a (fc,a)-outlier of G if for 
any /3 > a, the subset A does not intersect any minimizer 
of MIS0fc^/3(G). It is clear by definition that any subset of a 
[k, a)-outlier of G is also a (fc, a)-outlier of G. Hence, we 
define the (fc, a)-outlier set of G, Ok,a{G), to be the union of 
all (fc, a)-outliers of G which is itself a maximal (fc, Q!)-outlier 
of G. 

Note that the concept of an (/c, a)-outlier also depends on 
the potential function p. Also, Now, as a direct consequence 
of definition we have. 
Proposition 6. 

Given positive real numbers a < /3 and a weighted graph 
G=iX,E,Lu,ip,p), then Ok,a{G) C OkAG)- 

In order to match the definition with our intuition about 
outlier sets we should relate the potential function to the 
distance (i.e. the inverse of the similarity) function, and for this 
we adopt the special potential function for which the potential 
at each vertex is the mean of the distance of the vertex to the 
rest of the vertices. 



p{x) 



-El 

yex 



y\\ 



To get a feeling about how this definition works one may 
refer to the artificial example depicted in Figure [6] in which 
one can see that the residue of the minimizing subpartition 
increases as a consequence of an increase in the potential 
function (i.e. an increase of a). 

Now, the main problem is how one may approximately 
compute the outlier profile of a best candidate for an outlier 
set based on the above mentioned definition. For this, consider 
Figure [5] that shows the way of increasing of the residue 
number as increasing of a for the graph of Figure l6] Hereafter, 
this function is called the outlier profile of the graph. 



Fig. 5. An approximation of the outlier profile of the graph depicted in Figure [6| 



!i 1 

10 - 




^^^^^^^^^■9 


^^^^B 


BSBj^B^^^^^^^^^^^^I 


^^^^^^^^^^^^^^r^ 


^^^^H 


^^^^^^^^^^^^^^^^H 


^^^^^^^^^^^^ 


4^^^^| 


^^^^^^^^^^^^^^^^H 


^^^^^^^^^^H 


^V^^^l 


^^^^^^^^^^^^^^^^1 


^^^^^^^^^^p 


0- 


1 1 1 1 1 


1 1 1 1 



10 



20 



30 



40 



50 



60 



70 



80 



90 



100 



Fig. 6. The increase of residues as a increases (residue vertices ai'e depicted 
in black for a = 0, 0.6, 0.8, 2, 5, 50). 
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Now, our main problem is to extract the best candidate for 
an outlier set. For this, one may note that a possible natural 
criteria for this is a relatively long delay in a constant value 
in the outlier profile which is also close to the origin. As it 
is clear, the two intuitive conditions namely, being close to 
the origin (i.e. small a) and having a long stay in a constant 
value, are contradictory to each other and to set a criterion 
one needs an scaling factor Hence we let the parameter Us to 
be the desired scaling factor and define 



sm 



.{li) = exp(- 



-min(/i) -(max(/i) 
) - exp( ), 



(9) 



in which Ii = [min(/i),max(/i)] is the i'th interval on which 
the number of residues remains constant. Hence, we let the 
interval /* be the interval on which sra{Ii) is maximized and 
choose this interval as the space in which the best a lives. 
Following this idea we define 



a* — min(/*). 



(10) 



Since it is a hard problem to extract the whole outlier 
profile of a graph, we use our clustering algorithm to find 
an approximate minimizer for each a and we also apply a 
binary search on the spectrum of a to extract /* and a* up 
to a predefined precision. Such an approximation algorithm is 
described in Algorithm |4] It is important to note that based 
on what follows we can deduce that the performance of our 
algorithm is acceptable despite approximation, however the 
effect of approximation on the final result are undeniable 
(e.g. see Figure |9]l. Also, it should be noted that making the 
precision factors finer (e.g. smaller steps in the binary search) 
will give rise to a longer runtime which is undesirable. 

In order to evaluate the performance of the proposed algo- 
rithm we consider a couple of artificial problems. At first we 
applied the algorithm to the artificial clustering problems in 
Figure It] As it is clear from the results, the algorithm has been 
able to correctly extract clusters as well as the outlier set. 

In our final experimental evaluation we focus on the perfor- 
mance of the algorithm of Wang et. al. |32| since it has the 
best performance according to our analysis in Section [III] and 
also since using a proper local scaling parameter it is capable 
of extracting the outlier set as a cluster. In this direction we 
first consider a hard artificial clustering problem with outliers 
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Algorithm 4 HeuristicSearch 



Require: BPs(l...ri — k) = [oo, — oo] , a < b, Ua < nt, 
Input BPs, [a, b], [n^, n^], T, P, e 
if 6 — a < e then 

return BPs 
else 

subpartition = {The output of Algorithm l2| 
resno =ResidueNumber(subpartition) 
if resno < Ua then 

Update all BPs from index resno + 1 to n^ 

BPs = HeuristicSearch(_BPs, [a, b], [resno, ni,],T, P, e) 
else if resno > Tif, then 

Update all BPs from index rif, + 1 to resno 

BPs = HeuristicSearch(_BPs, [a, a], [ria, «&], T, P, e) 
else 

Update BPs with index resno 

if resno > na then 

BPs = HeuristicSearch(i3Ps, [a, a], [n^, resno], T, P, e) 

end if 

if resno < nt then 

BPs = HeuristicSearch(i3Ps, [a, b], [resno, nb],T, P, e) 

end if 
end if 
end if 



Fig. 7. Performance of Algorithm l4] (black spots are exti'acted outlier data 
points and as = 0.1 for all problems). 



as depicted in Figure [8] As it is clear from the results our 
algorithm has been able to successfully extract the outliers. 

On the other hand, in order to test the impact of the local 
scaling on the performance we also consider a hard artificial 
clustering problem as depicted in Figure [9] and as it is clear 
there are scaling parameters for which we can successfully 
extract the outlier set. In this regard it should be noted that 
in the local scaling setting for each individual case one can 
always find scaling parameters that works well, however, in 
our approach one may use the algorithm with as large a local 
scaling parameter as possible (considering ones runtime limits) 
without being concerned about the effect of this parameter 
on the performance of outlier extraction, while in the rest of 
the algorithms it is essentially this parameter which somehow 
tunes the algorithm to extract the outlier set as a cluster which 
is not consistent when one is dealing with a variety of different 
data sets. 



Appendix 
Proof of Proposition|4] 

Consider the following problem which is well-known to be 
iVP-complete in the strong sense ||9l. 

3-PARTITION 

INSTANCE: A positive integer B and 3m positive integers 
wi, . . . , w^m, such that P/4 < Wi < B/2, for 

each 1 < i < 3m and J2i=i "^i — "m-B- 
QUERY: Is there an m-partition {5J5" £ ^^{[Sm]) 



° 


..-"""■"■■v. 












1 


■i If ^ <■ 


. 


•. ■■• .:•■ .; •' . •• 


" 





















such that, for each 1 < j < ni, Y^ 
B1 



les, 



We provide a reduction from the 3-PARTITION problem. 
Assume that the integers wi,. . . ,W3m together with the in- 
teger B is an instance of the problem 3-PARTITION. Let t 
be a fixed positive integer and construct the weighted tree 

r= {V,E,uj,ip) as follows. 

V := {x, Xi,yj, zi | 1 < i < 3m, 1 < j < m, I < I < t} , 
E := {xxi, xijj, xzi I 1 < i < 3m, I < j < m, I < I < t} . 

Vertex weights are defined as follows. 

uj{x) :— 1, Lj{xi) := Wi + B + 1, V 1 < * < 3m, 
^iVj) := I7 V 1 < j < m, 
uj(zi) := B + l, y l<l <t. 

All edge weights is set to be equal to 1. Let k = m + t 
and A^ = 1 to get an instance of the problem MINI- 
MUM RESIDUE NUMBER. We assume that t is suffi- 
ciently large (e.g. t > 7m). We are going to show that 
MlSOfc(r) = 1/(P + 1). First note that for the subpartition 
A := {{xi}, ..., {x3m}, {zi}, ..., {zt}}, we have cost(yl) = 
1/{B + 1) and thus MISOfc(T) < 1/{B + 1). On the other 
hand, let B := {Bi, . . . ,Bk} be a minimizing subpartition 
achieving MlSOfc(r). Then there is some Bi which is com- 
pletely included in the set {zi, . . . ,zt}. Thus cost(S) > 
1/{B + 1). This shows that MISOfe(T) = 1/(P + 1). 

Now assume that the answer to 3-PARTITION is positive 
and let {S"!, . . . , Sm} be a partition of [3to], where the sum 
of the elements of each Si is equal to m. Therefore each Si 



11 



Fig. 8. A problem with global affinity a = 0.09 and Cs = 0.5. First row Fig. 9. A problem with local affinity parameter i/ = 20 and a^ = 0.5 . First 
contains the outcome of WJHZQ p2[ Algorithm. Second row contains the row contains the outcome of WJHZQ 1 32 1 Algorithm. Second row contains 
outcome of our algorithm with a* = 0.25. the outcome of our algorithm with a, = 0.24. 






6r 




has exactly 3 elements. Now define the fc— subpartition A 
{Ai^ . . . ,Ak] as follows, 

Aj := {vj.Xi I i e Sj), V 1 < j < m, 



A, 



{zj-m}, y m+1 < j <t 



We have A is a minimizing subpartition achieving 
MlSOfc(r) = 1/{B + 1) whose residue number is 1. 

Now, conversely, assume that there exists a minimizing sub- 
partition B achieving MlSOfe(r) = l/(^ + 1) whose residue 
number is at most 1. Then the vertex a; is a residue element 
for B. Because for each i, \Bi\ < 3m + 2 and if a; G Bi, 
then cost(i3) > t/2{3m + 1)(B + 1) > 1/{B + 1). Hence B 
is a partition of the set V \ {x}. Since cost(B) = 1/{B + 1), 
for all 1 < i < fc, the average of the weights of vertices 



in Bi is at least B + 1. Now delete all vertices zi, . . . ,zt 
from Bi's to obtain m nonempty subsets B'l, . . . , i?^„ with the 
average weights of at least B + 1. Since the average weight 
of all vertices {xi, Hj \ I < i < 3ni, 1 < j < m} is B + 1, 
the average weight of each B- is exactly B + 1. Now, since 
Wi's are positive, each i?- contains exactly one of the vertices 
2/1, .. . , j/m. Hence for each 1 < i < m, J2x eB' ^p " ^■ 
This finishes the reduction. 
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